Data Spaces thesis: Analysis of Mobile-Price-Classification dataset



Prof. Francesco Vaccarino


Student: s246996 Eugenio Emmolo

1 - Introduction and Dataset description

The analyzed dataset contains information about mobile phones: each mobile phone sample is described by 20 features (or predictors) plus one additional number that refers to the price range of the phone.

The dataset includes a total of 2000 labelled samples.

The price range value is an integer value $[0-3]$:

  • Zero refers to a low price category
  • One refers to a medium price category
  • Two refers to a high price category
  • Three refers to a very high price category

The dataset can be found at https://www.kaggle.com/iabhishekofficial/mobile-price-classification#train.csv

--- Here there is an extract of the dataset:
battery_power bluetooth clock_speed dual_sim front_cam 4G int_memory depth weight n_cores primary_cam px_height px_width ram screen_height screen_width talk_time 3G touch_screen wifi price_range
0 842 0 2.2 0 1 0 7 0.6 188 2 2 20 756 2549 9 7 19 0 0 1 1
1 1021 1 0.5 1 0 1 53 0.7 136 3 6 905 1988 2631 17 3 7 1 1 0 2
2 563 1 0.5 1 2 1 41 0.9 145 5 6 1263 1716 2603 11 2 9 1 1 0 2
3 615 1 2.5 0 0 0 10 0.8 131 6 9 1216 1786 2769 16 8 11 1 0 0 2
4 1821 1 1.2 0 13 1 44 0.6 141 2 14 1208 1212 1411 8 2 15 1 1 0 1
5 1859 0 0.5 1 3 0 22 0.7 164 1 7 1004 1654 1067 17 1 10 1 0 0 1
6 1821 0 1.7 0 4 1 10 0.8 139 8 10 381 1018 3220 13 8 18 1 0 1 3
7 1954 0 0.5 1 0 0 24 0.8 187 4 0 512 1149 700 16 3 5 1 1 1 0
8 1445 1 0.5 0 0 0 53 0.7 174 7 14 386 836 1099 17 1 20 1 0 0 0
9 509 1 0.6 1 2 1 9 0.1 93 5 15 1137 1224 513 19 10 12 1 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1990 1617 1 2.4 0 8 1 36 0.8 85 1 9 743 1426 296 5 3 7 1 0 0 0
1991 1882 0 2.0 0 11 1 44 0.8 113 8 19 4 743 3579 19 8 20 1 1 0 3
1992 674 1 2.9 1 1 0 21 0.2 198 3 4 576 1809 1180 6 3 4 1 1 1 0
1993 1467 1 0.5 0 0 0 18 0.6 122 5 0 888 1099 3962 15 11 5 1 1 1 3
1994 858 0 2.2 0 1 0 50 0.1 84 1 2 528 1416 3978 17 16 3 1 1 0 3
1995 794 1 0.5 1 0 1 2 0.8 106 6 14 1222 1890 668 13 4 19 1 1 0 0
1996 1965 1 2.6 1 0 0 39 0.2 187 4 3 915 1965 2032 11 10 16 1 1 1 2
1997 1911 0 0.9 1 1 1 36 0.7 108 8 3 868 1632 3057 9 1 5 1 1 0 3
1998 1512 0 0.9 0 4 1 46 0.1 145 5 5 336 670 869 18 10 19 1 1 1 0
1999 510 1 2.0 1 5 1 45 0.9 168 6 16 483 754 3919 19 4 2 1 1 1 3

2000 rows × 21 columns

1.1 - Tools and Libraries

I have chosen to use python3.7 as language code to develop this analysis because it is very intuitive and one of the most popular in the Machine Learning world. Also, python have available a lot of powerful libraries that are plug-and-play and pretty easy to use. Compared to the R language, I find python much simpler and intuitive and this is why I have used it.

I wrote all the code in Jupyter Notebook and I have found it very useful since it allows to insert code parts to markdown/textual parts. On top of that, it allows good visualization and graphic tools.

Main Libraries:

  • Pandas: it is an open source libray which provides high-performance, easy-to-use data structures and data analysis tools for the Python programming language.
  • Numpy: it is the fundamental package for scientific computing with Python.
  • MatplotLib: is a plotting library for the Python programming language that i used to create explanatory graphs.
  • Sciki-Learn: open-source library that provides machine-learning tools and algorithms.

Targets:

I. Explore the dataset by extracting some statistics and describe the features which characterize the samples.

II. Undestand what preprocessing steps are necessaries to make the dataset ready for further analysis.

III. Try some techniques to reduce the dimensionality of the dataset preserving as much information as possible.

IV. Try the set of classification models and techniques explained in the lectures to classify at best the samples in the correct price range.

V. Provide dimensionality reduced datasets to classifiers and evaluate their performances comparing them with results obtained with full datasets.

VI. Find out what are the most relevant features that are the most discriminative in classifying a sample in the correct price range (label).

2 - Data Exploration

In this section I have analyzed the datset pointing out some information that I have considered relevant. To do this, I used some boxplots, heatmap and other measures.

2.1 - Features Description

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 21 columns):
battery_power    2000 non-null int64
bluetooth        2000 non-null int64
clock_speed      2000 non-null float64
dual_sim         2000 non-null int64
front_cam        2000 non-null int64
4G               2000 non-null int64
int_memory       2000 non-null int64
depth            2000 non-null float64
weight           2000 non-null int64
n_cores          2000 non-null int64
primary_cam      2000 non-null int64
px_height        2000 non-null int64
px_width         2000 non-null int64
ram              2000 non-null int64
screen_height    2000 non-null int64
screen_width     2000 non-null int64
talk_time        2000 non-null int64
3G               2000 non-null int64
touch_screen     2000 non-null int64
wifi             2000 non-null int64
price_range      2000 non-null int64
dtypes: float64(2), int64(19)
memory usage: 328.2 KB

As I said before, each mobile phone is described by 20 numerical features that are not only technical details about its hardware components but also values correspondent to the performances of the device. More in detail the features are:

  • battery_power: [Numeric] Total energy that a battery can store in one time measured in mAh;
  • bluetooth: [Boolean] It specifies if the device has bluetooth or not;
  • clock_speed: [Numeric] Speed at which microprocessor executes instructions;
  • dual_sim: [Boolean] It specifies if the device has dual sim support or not;
  • front_cam: [Numeric] Front Camera mega pixels;
  • 4G: [Boolean] It specifies if the device has 4G or not;
  • int_memory: [Numeric] Internal Memory of the device in Gigabytes;
  • depth: [Numeric] Depth of the device in cm;
  • weight: [Numeric] Weight of mobile phone;
  • n_cores: [Numeric] Number of cores of processor;
  • primary_cam: [Numeric] Primary Camera mega pixels
  • px_height: [Numeric] Pixel Resolution Height;
  • px_width:[Numeric] Pixel Resolution Width;
  • ram: [Numeric] Random Access Memory in Mega Bytes;
  • screen_height: [Numeric] Screen Height of mobile in cm;
  • screen_width: [Numeric] Screen Width of mobile in cm;
  • talk_time: [Numeric] Maximum call time that a single battery charge will suppport;
  • 3G: [Boolean] It specifies if the device has 3G or not
  • touch_screen: [Boolean] It specifies if the device has touch screen or not;
  • wifi: [Boolean] It specifies if the device has wifi or not;

In summary, the dataset has a total of 14 Numeric variables and 6 Boolean variables that point out the characteristics of each mobile phone.

There are no NULL/missing values in the datset.

2.2 - Some statistics about the dataset

battery_power clock_speed front_cam int_memory depth weight n_cores primary_cam talk_time px_height px_width ram screen_height screen_width
mean 1238.518500 1.522250 4.309500 32.046500 0.501750 140.249000 4.520500 9.916500 11.011000 645.108000 1251.515500 2124.213000 12.306500 5.767000
std 439.418206 0.816004 4.341444 18.145715 0.288416 35.399655 2.287837 6.064315 5.463955 443.780811 432.199447 1084.732044 4.213245 4.356398
min 501.000000 0.500000 0.000000 2.000000 0.100000 80.000000 1.000000 0.000000 2.000000 0.000000 500.000000 256.000000 5.000000 0.000000
25% 851.750000 0.700000 1.000000 16.000000 0.200000 109.000000 3.000000 5.000000 6.000000 282.750000 874.750000 1207.500000 9.000000 2.000000
50% 1226.000000 1.500000 3.000000 32.000000 0.500000 141.000000 4.000000 10.000000 11.000000 564.000000 1247.000000 2146.500000 12.000000 5.000000
75% 1615.250000 2.200000 7.000000 48.000000 0.800000 170.000000 7.000000 15.000000 16.000000 947.250000 1633.000000 3064.500000 16.000000 9.000000
max 1998.000000 3.000000 19.000000 64.000000 1.000000 200.000000 8.000000 20.000000 20.000000 1960.000000 1998.000000 3998.000000 19.000000 18.000000

The table above shows some statistical measures about each numerical predictor variable of the dataset:

  • Mean is clearly the average value of the feature taking into account all the samples;
  • Std is the standard deviation of the features that describes how much data are spread away from the mean value;
  • Max and Min reports the maximum value and the minimum value of the feature in the dataset;
  • 25%, 75% represents the lower and upper quartiles that contains the specified percentage of data;
  • 50% represents the median.

For example: considering the battery_power variable:

  • The average value is 1238.51 mAh;
  • Its standard deviation is 439.42;
  • The minimum value in the whole dataset is 501 mAh while the Maximum value is 1998 mAh;
  • The 25% percentile is at 851.75 mAh, it means that in the range $[501 , 851.75] $ mAh, the 25% of the entire dataset is contained;
  • The same for 50% and 75% percentiles.

The dataset presents numerical features with very different unit of measure: for example, "battery power" is 3 order of magnitude bigger that "clock speed", "depth" is three order of magnitude smaller that "px_width" and "ram" and many others.

For this reason I have decided to standardized the values of the dataset, in order to avoid a higher impact of features that have higher absolute values.

2.3 - Sample Class distribution

The dataset is perfectly balanced as it includes 500 samples for each price category, for a total of 2000 samples. Given the already balanced structure of the dataset, I have considered as useless the employment of any balancing technique.

Text(0, 0.5, 'Number of Samples')

2.4 - BoxPlots

In general, a boxplot is useful to show in a simple way the value distribution assumed by some data and the range of values where there is the higher probability to found data. It consists in a box and a pair of whiskers: the box marks the range from 25% to 75% percentiles, identified in the picture by $Q1$ and $Q3$, while the whiskers points respectively to $Q1 - 1.5*IQR$ and $Q3 + 1.5*IQR$.
Values outside the whiskers are marked as outliers because they are outside a region that collects more than 99% of all the values according to its distribution. Moreover, the 50% percentile coincides with the median and it is marked with a dash in the figure (remind that the median is the central value among data which splits in equal number the data at his right and at his left).

In the following plots I have displayed, for each numerical feature, the relative boxplot highlighting its value distribution.

Text(0.5, 1.0, 'screen_width')

There are no outliers except for the 'pixel_height' and 'front_camera' variables. The boxplot marks as outliers the points that exceed the whiskers, because they are out of the range that delimits more than the 99% of probability to find a new sample. A small subset of the features present an asymmetric value distribution; they are:

  • Front camera
  • Core number
  • Pixel height
  • Screen width

Their mean value do not coincide with their median and they have a high value of variance: it means that the points are likely to be pretty far from the expected values. Their distribution is not centered in the mean value but is shifted higher or lower.

2.5 - Boxplot of values assumed by the features separately for each price category

I have plotted the distribution of values assumed by each numerical feature separately for each price category by means of a boxplot.

These Box plots shows the values assumed by the features considering the whole dataset, separately for each price category. I can see the value distribution of each feature among all the classes.

From these pictures some intresting information can be pointed out:

  • Looking at the plot of the battery_power, it seems that the higher the price category and the higher becomes the mean value of battery_power.
  • On the contrary, the weight of the device tends to be lower if the device belongs to a high price category; this is usually true in real life: top-level mobile phones tend to be made as light as possible;
  • The primary camera seems to have more MegaPixels in devices that belongs to the top price range;
  • The screen resolution increases as the price range of the device grows (a positive trend is shown by the boxplot);
  • The mean value of RAM installed in the devices increases significantly as the price range if the devices grows;
  • On the contrary, features like clock_speed or screen_height have values which are almost constant and independent from the price category: no matter the price of the device, those features will have nearly the same values.

In summary, I can say that features describing hardware components of a device show a positive and growing trend starting from class one (LOW) to class 3 (VERY-HIGH) and this is a reasonable consequence to the fact that devices with high quality specs (represented in the dataset by features with higher values) are more expensive than devices with low quality specs.

2.6 - Correlation Matrix

I have plotted the correlation matrix of the features of the dataset. The matrix is symmetric and it shows the correlation degree that the pair of variables shares. I used the Pearson Correlation coefficient to calculate the realtionship between the feature variables: it is a value that describes the mutual variability between two variables.

Given two random variables $ X, Y $, the Pearson correlation coefficient is defined by:

$ \ \rho _{{XY}}={\frac {\sigma _{{XY}}}{\sigma _{X}\sigma _{Y}}} $

Where $\ \sigma _{{XY}} $ is the covariance of $ X, Y $ and $ \ \sigma _{X},\sigma _{Y}$ are the two standad deviations.

Such correlation index is a number in the range $[-1 , 1]$:

  • A high correlation value means that the two variables are positively correlated; hence, for every positive increase in one variable, there is a positive increase of a fixed proportion in the other.
  • A low correlation value means that the two variables are negatively correlated; hence, for every positive increase in one variable, there is a negative increase of a fixed proportion in the other.
  • A correlation value near to zero mean that the two variables can be considered as uncorrelated and so there is no relationship between the two variables at all.
<matplotlib.axes._subplots.AxesSubplot at 0x1a1d60f8d0>

Correlations found in the dataset:

  1. The first thing that come at sight is the high correlation value between the RAM and the price category: high values of RAM are positively correlated with a high price range. It seems reasonable since expensive smartphones usually have a lot of RAM available.

  2. Also, the pixel resolution shows a slight correlation with the price category, it could mirror reality because, usually, top-level phones have a high density of pixels and so a powerful display (which is clearly costly). Furthermore, the pixel resolution height and width are positively correlated: if height resolution increases, width increases proportionally as well and it seems reasonable.

  3. The dimensions of the screen are positively correlated (width and height measured in centimeters); this is ok since the devices standard shape is rectangular and so it cannot happen that the height grows while the width remains inalterate.

  4. Other significant relationships are the one that links 3G and 4G technologies and the one that exists between primary_cam and front_cam features.

  5. There are no strong negative correlations between features in the dataset, just some light ones around $-0,1$.

Preprocessing analysis and steps performed:


The analyzing datset composition and its features, I have

  1. Split data into a column array for the response labels $(Y) \; [2000x1]$ and a matrix $(X)\; [2000x20]$ that contains the feature samples.
  2. Data standardization: to bring feature values in a standard form: $\mu = 0$ and $\sigma = 1$.
  3. No balancing techniques have been applied because the dataset sample distribution among classes is already balanced.
  4. No need to fill value gaps because there are no NULL/missing values.
  5. All features in the dataset are numerical (no strings are present) and so they are ready to be processed by algorithms.

3 - Dimensionality Reduction

3.1 - Feature manipulation

Since the screen size is described by means of two variables (screen_height and screen_width), I choose to merge them in a unique variable that is the diagonal of the screen, given that the two features have been standardized and they have the same scale. To do that, I simply applied the Pythagoras Theorem and calculate the diagonal starting from the height and the width (it is possible assuming a rectangular shape of mobile phones).
This way I have reduced by two the dimensionality of the dataset (20 --> 19 features).

3.2 - PCA

The Principal Component Analysis is an unsupervised data dimensionality reduction algorithm that allows to summarize a large feature set into a smaller number of features that together explains as much variability as possible of the original dataset. In other words, PCA finds a lower-dimensional representation of data that captures as much of the information as possible.
The idea behind PCA is that not all the dimensions of the original dataset are equally interesting; PCA simply looks for a small number of features which are as much interesting as possible and it discards the others. The concept of interesting is related to the amount of variance explained.

The so-called principal components are nothing but the directions in the space that contains the maximum variance of the considered dataset; PCA finds a number of PC (directions) which is smaller than the number of the original features and then projects data into the new smaller-dimensional subspace. These directions are orthogonal each other and they are as close as possible to the cloud of data.

More in detail, the first PC of a set of features ${X_1,X_2, \dots , X_p}$, is the normalized linear combination of the features: $Z_1 = \beta_{{11}}X_1 + \beta_{{21}}X_2 + \dots + \beta_{{p1}}X_p $ that has the largest variance. The beta coefficients are the loadings of the first PC and together they constitute the PC Loading vector $\beta_1 = \{\beta_{{11}}, \beta_{{21}}, \dots , \beta_{{p1}}\}^T $.

Normalized means that $\sum_{j=0}^{p} \beta_{{j1}}^2 = 1$
This constaint is strictly necessary, otherwise the variance along the considered dimension can become aribitrarily large as the coefficient values grow in absolute value. Hence, the new orthonormal basis must be contained in an hypersphere with radius = 1.

The second PC $ Z_2 = \beta_{{12}}X_1 + \beta_{{22}}X_2 + ... + \beta_{{p2}}X_p $ is the linear combination of ${X_1,X_2, ... , X_p}$ that has the maximal variance in the residual space, that is out of all linear combinations that are uncorrelated with $ Z_1 $. And so on and so forth for all the other PCs.

Remind that putting the constraint that $ Z_2 $ is uncorrelated with $ Z_1 $ is equivalent to constraining that the direction $ \beta_1 $ is orthgonal with the direction $ \beta_2 $.

The first principale component loading vector solves the following optimization problem:

$Maximise_{\beta_{11}, \dots, \beta_{p1}}\quad \Bigl\{ \frac{1}{n} \sum_{i=1}^{n}(\sum_{j=1}^{p}\beta_{j1} x_{ij})^2 \Bigl\} \qquad$ subject to:$\quad \sum_{j=0}^{p} \beta_{{j1}}^2 = 1$

PCA is unsupervised because there is no associated response (Y) needed.

PCA is also a great tool for data visualization: selecting the first two PC, which are the most informative and representative, you can produce a scatter plot of a much higher-dimensional dataset which cannot be visualized otherwise.

If you want to apply PCA and reduce the dimensionality of a dataset $D$ with $n$ features to a new one with only $k$ features, these steps are required:

  1. Standardize the data
  2. Build the Covariance matrix $\Sigma$ and decompose it into Eigenvectors and Eigenvalues;
  3. Sort the Eigenvalues in decreasing order to obtain a ranking of the correspondent Eigenvectors;
  4. Select top $k$ eigenvctors according to the ranking of the eigenvalues; $k$ will be the new dimensionality of the reduced dataset;
  5. build a new projection matrix $W$ using the selected top $k$ eigenvectors;
  6. Transform the initial dataset $D$ by multiplying it with $W$.

    $D_{transformed} = D*W$

<BarContainer object of 13 artists>

We can see from the plots above that the first 4 PC explains the 100% of the total variance of data, hence from the 5th PC on, the variance explained is zero. I can reduce the dimensionality of my dataset to 4 features that would coincide with the first 4 PC.

Hence, the size of the dataset is reduced from $ 2000x20 $ to $ 2000x4 $.

The following table shows the pruned dataset.

Dataset made by the first 4 Principal Components:
PC 0 1 2 3
Samples
0 0.567048 -1.047966 -0.513679 0.073266
1 0.665010 0.917376 -0.310298 0.452913
2 0.623323 1.006028 -0.895563 -0.150015
3 0.842575 1.026767 -0.830672 -0.040036
4 -0.946824 0.503453 0.778334 -0.516690
5 -1.396626 0.707059 0.806715 0.094987
6 1.445824 -0.446547 0.771686 0.030193
7 -1.874172 -0.222487 0.942272 0.066829
8 -1.347255 -0.630160 0.282685 -0.142795
9 -2.127808 0.420890 -0.950903 -0.486735
... ... ... ... ...
1990 -2.408654 0.241164 0.491801 0.111601
1991 1.923134 -1.051957 0.856026 0.108579
1992 -1.241512 0.423718 -0.767066 0.580321
1993 2.416565 0.112515 0.312586 -0.374349
1994 2.443039 0.048059 -0.508324 0.228859
1995 -1.924101 1.111208 -0.600521 0.082018
1996 -0.124375 0.913838 0.932865 0.467268
1997 1.225588 0.575048 0.874804 0.179792
1998 -1.649716 -0.829488 0.376349 -0.250766
1999 2.365241 -0.600484 -0.940926 -0.375895

2000 rows × 4 columns

3.2.1 - Data visualization with PCA

In the following plot I wanted to represent the data in 2D using PCA result: I selected the first two principal components that are the most informative and which explain about the $ 83$% of the total variance and I displayed a scatter plot showing also the class distributions with different colors.

The dataset is reduced from $ 2000x20 $ to $ 2000x2 $.

The class distributions are a little overlapped but each class can be easily distinguished; it seems that points belonging to low price categories tend to lie on the left side of the picture while high price categories points lie on the right side.

3.3 - Feature Selection Exploiting Random Forest Classifier

As the section title says, I have exploited the Random Forest classifier, fitted on the data, to extract a measure of "importance" related to each feature; the algorithm assigns a value of importance to each feature: the higher the value and the more important the algorithm retains that feature in the classification task.

In scikit-learn, the importance coefficient is called “gini importance” or “mean decrease impurity” and is defined as the total decrease in node impurity weighted by the probability of reaching that node (which is approximated by the proportion of samples reaching that node) averaged over all trees of the ensemble.

More in detail:

Evaluate the importance of a variable $X_m$ for predicting $y$ can be done by adding up the weighted impurity decreases $p(t)\Delta i(s_t , t)$ for all nodes $t$ where $X_m$ is used, averaged over all $N_T$ trees in the forest.

Impurity formula:

$Imp(X_m) = \frac{1}{N_T} 􏰁\sum_T \sum_{t \in T : v(s_t) = X_m} p(t)\Delta i(s_t , t)$

where $p(t)$ is the proportion $\frac{N_T}{N}$ of samples reaching t and $v(s_t)$ is the variable used in split $s_t$.

Steps followed:

  1. Start with an empty dataset;
  2. Each iteration I add the most important feature according to the feature importance metric of the Random Forest, from the one with the highest importance to the lowest (incrementing the dataset's columns by 1); then I train a new Random Forest classifer using a 10 split K-Fold Cross-Validation, and I evaluate its performances looking at the F-Measure;
  3. Select the number of feature that provides the best trade-off between the number of feature used and the performance obtained.

In section 4.2 - Random Forest I will explain more in detail the Random Forest classifier and also the Gini Index in Purity coefficient: GINI

Comments about the obtained results:

The plot above shows that the trend assumed by the performance of the model as I progresively add one feature at each iteration. At first, when I add the most important features according to the Random forest Gini Importance, the F-Measure grows and so the model increase its effectiveness in recognizing the classes but from the 5th feature on, the performances slowly decrese even if in a non-linear way.
Therefore I think that a reasonable choice could be to keep the first 5 features that have been added because they are a good trade off between the number of features I keep and the performance obtained. The resultant pruned datset will have a shape of (2000, 5), keeping the 5 most important features according to the Random Forest metric.

I know that there are many other ways to combine the features which might give better results, but I retained an exaustive exploration of the entire space of the solutions out of the scope of this analysis.

4 - Classification

In this section I have explained the working principles of some of the classification algorithms we have studied in the lectures and I have applied them on the "cleaned" dataset, trying to understand what algorithm works better in classifying a sample in the correct price category.

For each classifier employes I have followed these steps:

  1. As a preliminary step, I have performed a search for the best setting of the hyperparameters of each algorithm: I have trained several different models created with a different hyparameter values and I have picked up the setting that give the best outcome according to the F-Measure. I have chosen the F-measure because it gives an estimation on how good is the model in recognizing the different classes; I could have also decided to use the accuracy as a metric as the dataset is perfectly balanced.

  2. For each algorithm I have split the dataset into train and test set with proportions 70% training and 30% test: the training one is employed to build the model while the test one to evaluate its performance.
    I have chosen to take track of the accuracy, precision, recall and F-Measure of the model.

  3. I have performed a K-fold cross-validation of the training set with K = 10 in order to create a robust model as little biased as possible. Once the model is trained, I evaluate once more its performances on the test set I took apart before.

  4. I repeated step 2 and 3 using a dataset which has been reducted with one of the techniques described before (PCA and Random Forest-based feature selection).

The classification algorithms that I have implemented are:

  • Decision Tree
  • Random Forest
  • Support Vector Machine
  • K Nearest Neighbor

Evaluation parameters: Confusion Matrix, Accuracy, Precision, Recall and F-Measure

Confusion Matrix

The confusion matrix is a matrix that helps in describing the performance of a classification algorithm on a test set for which the true labels are known. It gives a summary of the classifier predictions by showing the count of both correct and wrong predictions separately for each class.
On the columns we find the ground truth (correct labels) while on the rows we find the model's predictions. For a simple binary classification, there are four main classes of predictions:

  • True Positive (TP) : Observation is positive, and is predicted to be positive.
  • False Negative (FN) : Observation is positive, but is predicted negative.
  • True Negative (TN) : Observation is negative, and is predicted to be negative.
  • False Positive (FP) : Observation is negative, but is predicted positive.

Accuracy

It is defined by the relation: $\frac{TP + TN}{TP + TN + FP + FN}$
It gives an estimate of how many errors the model has done, assuming an equal cost for each error.

Precision

It is defined by the relation: $\frac{TP}{TP + FP}$
It tells how many samples that the model classified to a certain class effectively belong to that class. It gives an estimate of the precision of the decisions the model makes.

Recall

It is defined by the relation: $\frac{TP}{TP + FN}$
The recall can be defined as the ratio of the total number of correctly classified positive examples divided by the total number of positive examples. A high value of Recall indicates how much the considered class is correctly recognized by the model. In other words, it tells how many samples of a certain class have been correctly classified to that class.

F-Measure

It is defined by the relation: $\frac{2*Recall*Precision}{Recall+Precision}$
It is a sort of trade off that take into account both Recall and Precision; it gives a measure of how good is the model at recognizing a certain class and how precise the model's decision are.

K-Fold Cross-Validation

K-fold Cross-Validation technique consists in dividing the training set in K distinct folds, all containing the exact number of samples, and then, at each step of the training phase, the $ Kth $ portion becomes the validation set while all the other parts togeher constitute the training set.
This process is repeated K times, selecting each time a different portion as validation set. Each iteration the model is updated and some statistics about its performance are computed; At the end of the training process I have calculated the average of the above-mentioned measures, obtaining a score of how good the model has been on the training set throughout the iterations.

Tree-Based Classifiers: Decision Tree and Random Forest

In the next, I have outlied the main working principles of these two algorithms and then I have trained them and tested their performances.
First, I have chosen to provide as input a version of the dataset which is not standardized exploiting the fact that trees can handle such kind of data; secondly I have provided the dataset reduced by the sperimental technique of feature selection based on Random Forest's feature importances to see how they reacted and performed.

4.1 - Decision Tree

A Decision tree is a flowchart like tree structure where:

  • each internal node denotes a test on an attribute (a feature)
  • each branch represents an outcome of the test (a decision)
  • each leaf node holds a class label.

Tree splits can be binary or multiple and the depth of the tree is limited. A decision tree represents a procedure for classifying categorical data based on their attributes: a new test sample goes down in the tree starting from the root and following the branches towards a leaf taking a series of decisions which are based on the values of the sample's attributes.

Build a decision tree:

Algorithms for constructing decision trees usually work top-down and apply a greedy strategy, by choosing each step the variable that best splits the set of items; this is done to simplify the resolution of the problem which is NP-hard otherwise. Therefore, starting from the entire dataset, each step the best split attribute is chosen, a new node in the tree is created as well as two subtrees downstream (or two leaves in terminal cases).
The best split is the one that produces two (or more) output subsets which are made by samples belonging to the same class as much as possible. The more a split on an attribute separates the classes, the better it is. The process terminates when a certain depth is reached or when a certain degree of purity is reahced.

Purity coefficient: GINI

In order to decide the best attribute to use to split the data, I used the Gini index which gives a measure of the purity degree or level of disorder of the resultant splits generated downstream the node 't'.

The Gini index of the node 't' is defined by the relation: $GINI(t) = 1 - \sum_{j} [p(j/t)]^2$
$ p(j/t) $ refers to the relative frequence of class 'j' at the node 't'.

A value of Gini equal to zero corresponds to the perfect separation where a subset contains all values of Class#1 and the other subset contains all values belonging to class#2. The worst possible separation has a Gini of $ \; 1 - \frac{1}{n} \; $ (with n equal to the total number of classes), this is the case in which the samples in the subsets are equally mixed.

Good and bad points of Decision Trees:

The strengths of decision tree methods are:

  • Decision trees are able to generate understandable rules.
  • Decision trees perform classification without requiring much computation (Fast and cheap).
  • Decision trees are able to handle both continuous and categorical variables without the need to create dummy variables (Flexibility).
  • Decision trees provide a clear indication of which fields are most important for prediction or classification.

The weaknesses of decision tree methods are:

  • In general, decision trees do not have the same level of predictive accuracy compared to other algorithms.
  • Decision trees are less appropriate for estimation tasks where the goal is to predict the value of a continuous attribute (but this is not the case beause I am predicting a label).
  • Decision trees are suffer from high variance: it means that fitting two different trees respectively on two random subsets of the same dataset, can produce quite different results.
  • Finally, decision trees can be very non-robust: a small change in the data can cause a big change in the final tree.

Decision Tree Grid-Search for best max_depth value:

Decision Tree Results:
Train_Score Test_Score
Accuracy 83.86 81.50
F1-Score 83.87 81.42
Precison 84.44 81.52
Recall 83.86 81.50

Recall and precision scores for each class
Recall Precision
Low 0.89 0.88
Medium 0.80 0.74
High 0.71 0.79
Very-high 0.86 0.85

Tree plot

Launch DT on Random forest reduced dataset

Decision Tree Results with RFreduced dataset:
Train_Score Test_Score
Accuracy 84.29 87.00
F1-Score 84.20 87.07
Precison 84.58 87.39
Recall 84.29 87.00

Recall and precision scores for each class
Recall Precision
Low 0.92 0.91
Medium 0.81 0.87
High 0.87 0.77
Very-high 0.88 0.94

4.2 - Random Forest

Bagging

As I have said before, decision trees suffer from high variance; bagging is a general-purpose technique that aims at reducing the variance of a statistical learning method. The Statistical basis behind this technique is that given $n$ independent observations $Z_1, Z_2, \dots, Z_n$, each one with variance equal to $\sigma^2$, the resultant variance of the observation mean $Z$ is equal to $\frac{{\sigma^2}} {{n}}$.
Bagging is usually employed and particularly effective in the context of decision trees.
By pooling predictions, we can incorporate much more knowledge than from any one individual model/tree, each one brings their own background experience and information sources to the problem. In short, what bagging does is to construct B decision trees using B bootstrapped training sets (generated from a single unique dataset) and then average the resulting predictions $f^b$. Even if each tree has a high variance, combining a big number of trees together in the same procedure reduces the variance.

Bagging relation for prediction: $f_{{bag}}(x) = \frac{1}{B} \sum_{b=1}^{B} f^b(x)$

Random Forest algorithm introduces an improvement over bagged trees that decorrelates the trees. When the set of decision trees are built, each time a split node in a tree has to be built, a random sample of N predictors is chosen as split candidates from the entore set of P predictors. Therefore, each split is allowed to use only one of those N predictors (typically $N = \sqrt{{P}}$).

The novelty and great idea of Random Forest is that it forces each split node to consider only a subset of the predictors. This way, the estimators, that are the trees of the forest will not be influenced by the same strong predictors of the dataset but they are obliged to consider also other less-strong predictors thence decorrelating the trees. From this rationale comes the model's name Random Forest.

The main limitation of Random Forest is that a large number of trees can make the algorithm to slow and ineffective for real-time predictions. In general, these algorithms are fast to train, but quite slow to create predictions once they are trained. A more accurate prediction requires more trees, which results in a slower model.
On the other hand, thanks to its structure the Random Forest algorithm produces optimal results even with its default hyperparameters.

In summary:

  • (+) Resistant to overfitting;
  • (+) Good performance in almost all contexts;
  • (+) Fast to train;

  • (-) Slow in producing outputs and not suitable for real time predictions (if number of estimators is high);

Random Forest Grid-Search for best n_estimators value:

Random Forest Results:
Train_Score Test_Score
Accuracy 86.93 89.33
F1-Score 86.91 89.31
Precison 87.27 89.53
Recall 86.93 89.33

Recall and precision scores for each class
Recall Precision
Low 0.93 0.96
Medium 0.90 0.82
High 0.80 0.88
Very-high 0.94 0.91

Launch RF on Random forest reduced dataset

Random Forest Results with RFreduced dataset:
Train_Score Test_Score
Accuracy 90.50 91.50
F1-Score 90.46 91.50
Precison 90.81 91.54
Recall 90.50 91.50

Recall and precision scores for each class
Recall Precision
Low 0.94 0.96
Medium 0.92 0.88
High 0.86 0.88
Very-high 0.94 0.94

Tree Models performance and comments

As it could be expected, Random Forest algorithm performs better than simple decision tree; this is absolutely reasonable since the random forest merges many outputs of its internal decision trees providing a better estimation and comprehension of the classification problem.
However, Random Forest is much slower than the Decision Tree in terms of computation time, especially with a high number of estimators (like 800 for example).

Surprisingly, the Random Forest-based method for feature selection has provided good results for both Decision Tree and Random forest models, indeed their overall performances are slightly better, around 3-4% more. Hence, cutting the dataframe size keeping only a subset of features has not only lightened the computation load of the program but also made it faster.
If, for some reasons, the classification task must be made faster, this approach could be taken in high consideration.

<matplotlib.legend.Legend at 0x1a24654cc0>

Distance-Based Classifiers: Support Vector Machine and K-Nearest Neighbors

In the next, I have outlied the main working principles of these two algorithms and then I have trained them and tested their performances.
First, I have chosen to provide as input the dataset standardized because it seemed to me the best choice since these two algorithms take decisions by manipulating distances; therefore, it is necessary to avoid a heavy impact of features which have bigger values in absolute values.
Secondly I have provided the dataset reduced by PCA to see how they reacted and performed.

4.3 - Support Vector Machine

The Support Vector Machine or SVM bases its classification task on the concept of hyperplane: the model search for the hyperplane that separates the data into groups in the best possible way.

Mathematical definition of a p-dimensional hyperplane: $ f(x): \beta_0 + \beta_1X_1 + \beta_2X_2 + \dots + \beta_pX_p = 0$

Given a datsaet that is linearly separable, the best hyperplane among infinite solutions is the one that maximises the Margin: the margin is the smallest distance from the training sample $X_i$ of each class to the hyperplane. Hence, the optimal separating hyperplane is the furthest one from the training data point $X_i$ of each class.

Given a point $X$, it can lie:

  1. Exactly on the hyperplane if $ f(x) = 0$;
  2. On the right with respect to the hyperplane [$y_i = 1$]: $ f(x) > 0 \qquad$ (red dots)
  3. On the left with respect to the hyperplane if [$y_i = -1$]: $ f(x) < 0 \qquad$ (blue dots)

In summary, M represents the margin of the hyperplane and we want to find the $\beta$ coefficients in order to maximise M.

The maximal margin hyperplane can be calculated by solving this optimization problem (Hard margin classifier):

$\max_{{\beta_1, \beta_2, \dots, \beta_p}}M$

such that

$\sum_{k=1}^{p} \beta_k^2 = 1$

$y_i (\beta_0 + \beta_1x_{i1} + \dots + \beta_px_{ip}) >= M $

for each $i = 1 \dots N$

  • M is the width of the Margin;
  • Note that $y_i (\beta_0 + \beta_1x_{i1} + \dots + \beta_px_{ip}) >= M $ guarantees that each observation is on the correct side (because M is positive) and also that each observation is outside the 'Margin region' (each point is at least at distance equal to M from the hyperplane).

However, as I said, this solution to the problem forces the points to stay at least at distance M from the hyprplane; this could be a problem for datsets which have a distribution of samples that do not allow this constraint.
For this reason, a soft margin classifier can be created by relaxing the constraint on the minimum distance between the points and the separating hyperplane; in other words, rather than seeking the largest possible margin which separates perfectly each observation on the correct side (and outside the 'margin region'), the model allows some training observations to be misclassified in order to do a better job in classifying all the remaining observations.

Soft margin classifier optimization problem:

$\max_{{\beta_1, \beta_2, \dots, \beta_p, \epsilon_1, \dots, \epsilon_n}}M$

such that

$\sum_{k=1}^{p} \beta_k^2 = 1$

$y_i (\beta_0 + \beta_1x_{i1} + \dots + \beta_px_{ip}) >= M(1-\epsilon_i) $

$\epsilon_i >= 0$, $\sum_{i=1}^{n} \epsilon_i <= C$

  • C is a non-negative tuning parameter which bounds the sum of the $\epsilon_i$'s determining the number and the severity of the violations to the margin that we want to tolerate.
    In summary, as C parameter increases, the model become more tolerant of violations to the margin, and so the margin will widen. Conversely, as C decreases, we become less tolerant of violations to the margin and so the margin narrows.
  • $\epsilon_1, \dots, \epsilon_n$ are called 'slack variables' and they allow individual observations to stay on the wrong side of the margin or the hyperplane.

SVM Grid-Search for best C value:

Linear SVM Results:
Train_Score Test_Score
Accuracy 86.93 89.83
F1-Score 86.85 89.78
Precison 87.25 89.96
Recall 86.93 89.83

Recall and precision scores for each class
Recall Precision
Low 0.99 0.98
Medium 0.75 0.85
High 0.86 0.78
Very-high 0.99 0.99

Launch SVM on PCA-transformed dataset

Linear SVM Results with PCA-transformed dataset:
Train_Score Test_Score
Accuracy 85.79 85.67
F1-Score 85.64 85.58
Precison 85.86 85.50
Recall 85.79 85.67

Recall and precision scores for each class
Recall Precision
Low 0.97 0.97
Medium 0.74 0.75
High 0.74 0.74
Very-high 0.97 0.94

4.4 - K-Nearest Neighbors

K-NN is a non-parametric, lazy learning algorithm. Its purpose is to use a database in which the data points are separated into several classes (they are labelled) to predict the classification of a new sample point.
Given a new sample $X$ to be classified, the model calculates the distance between the new point and all the others in the dataset, then it takes the $K$ nearest points and the new object is classified by a majority vote of its neighbors, so it is given the most common class among its k nearest neighbors.
K-NN is based on the concept of feature similarity and n-dimensional spatial distance between points.

K-NN is non-parametric, which means that it does not make any assumptions about the probability distribution of the input. This is useful for applications with input properties that are unknown and therefore makes k-NN more robust than algorithms that are parametric.
However, parametric machine learning algorithms tend to produce fewer errors than non-parametric ones, since taking input probabilities into account can influence decision making.

In order to calculate the distance of two points $p$ and $q$ I have used the Euclidean distance, defined as: $$d(p, q) = \sqrt{\sum_{i=1}^N p_{i}^2 + q_{i}^2}$$

A key point in K-NN algorithm is the choice of the value of K:

  • A low value of K means that the algorithm could be strongly influenced by noise points or outliers;
  • A high value of K may lead the model to consider too many points and to mix the classes producing low-accurate or wrong predictions;

KNN pros:

  • K-NN does not need any training phase at all since the model is the dataset itslef;
  • It is an incremental algorithm: if new samples become available, they are simply added to the dataset and the model is updated.
  • It is simple and suitable for non-linear data;
  • K-NN does not assume any probability distributions on the input data. This can come in handy for inputs where the probability distribution is unknown and is therefore robust.

KNN cons:

  • It is computationally expensive and takes a lot of time to produce results compared to other algorithms;
  • It requires a lot of memory because it needs to store all data points and it computes the distances between the new point and all the other points each time a new object has to be classified.
  • Sensitive to dataset sample distribution. If one type of category occurs much more than another, classifying an input will be more biased towards that one category;
  • Sensitive to localized data. Since k-NN gets all of its information from the input's neighbors, localized anomalies affect outcomes significantly, rather than for an algorithm that uses a generalized view of the data.

KNN Grid-Search for best n_neighbors value:

KNN Results:
Train_Score Test_Score
Accuracy 92.07 92.33
F1-Score 92.03 92.29
Precison 92.17 92.52
Recall 92.07 92.33

Recall and precision scores for each class
Recall Precision
Low 0.96 0.97
Medium 0.95 0.87
High 0.83 0.93
Very-high 0.96 0.92

Launch KNN on PCA-transformed dataset

KNN Results with PCA-Transformed dataset:
Train_Score Test_Score
Accuracy 92.79 92.83
F1-Score 92.78 92.87
Precison 92.94 93.01
Recall 92.79 92.83

Recall and precision scores for each class
Recall Precision
Low 0.97 0.95
Medium 0.92 0.87
High 0.89 0.90
Very-high 0.92 0.99

KNN and SVM performances and comments

KNN performs very well, even better than the SVM. Anyway, SVM provides performances that are near to 90% and so thay are extremely good as well.
KNN results seems even too high, maybe its due to the high value of K (K = 15) and to a particularly convenient spatial data distribution of samples in the dataset (a lot clusterized as the PCA plot has shown).

Finally, I have run the the same classifiers but using the PCA-transformed dataset and I have obtained nearly the same results on each measure as regards KNN, but a little decrease as regards SVM (around 3-4%); this is interesting because the dataset contained only 4 features which are the 4 most important Principal Components obtained by PCA. So, Even if the dataset dimensionality has been reduced a lot, results are still extremely good and there is not a big performance decrease. In this case PCA does a great job.

<matplotlib.legend.Legend at 0x1a23cf2dd8>

4.5 - Overall performance analysis

  • A common aspect to all the employed classifiers is that the two middle classes (Medium and High) shows worse results in terms of recall and precision: this means that the models make more mistakes in recognizing them, while they are better in recognizing the classes at the extremes (Low and Very-High).

  • The classifier that works best is the KNN while the worst is the Decision Tree classifier; SVM and random forest are in the middle and offer reliable and good results as often happens.

  • I have decided to launch each kind of classifier with the dataset I retained the best for its characteristics and working principles to make it works at its best; therefore, I compared the different classifiers I employed even though they are trained on different versions of the same dataset.

  • As regards computation time (time to get the classification results), Random forest is the slowest one due to its high number of trees, KNN is probably a bit slower than the SVM and decision tree which provide results in a few seconds.

Here there is a table that summarizes all the results obtained:

Accuracy F-Measure Precison Recall
Decision Tree 83.83 83.75 83.80 83.83
Random Forest 89.33 89.31 89.53 89.33
SVM 89.83 89.78 89.96 89.83
K-Nearest Neighbors 92.33 92.29 92.52 92.33

5 - Conclusions

In this analysis I have deeply examined as many characteristics and behaviors of the dataset as I could; I have explored a set of different classifiers trained on the data and I have seen how they performed. Moreover, I have tried to point out the most effective features which determine the price range of a mobile phone.

Classification algorithms considerations:
Even if the best results have been obtained by the KNN model, I would suggest the Random Forest classifier as the best one to classify a new sample in the correct price category because in general it is a much more reliable and resistant algorithm than KNN and , furthermore, Random Forest results are nearly as high as KNN if trained on the reduced dataset.
Last, both Random Forest and KNN have the highest scores as regards precision and recall of middle classes (medimum and high), so they manage to perform a good work where the other algorithms (Decision Tree and SVM) encounter difficulties.

Dataset's most important features in discriminating between price-categories:
ram, px_height and px_width (hence the screen resolution) and the battery power.
By analyzing only such features you can give a quite precise estimate of the price category of the correspondent mobile phone.

In conlusion, this analysis has allowed me to dirty my hands in extracting information out of a dataset and using, in a real case, the dimensionality reduction methods and classification algorithms I have studied in theory.